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1. Introduction 


Continuous treatments or exposures (such as dose, duration, and frequency) arise very often in practice, es¬ 
pecially in observational studies. Importantly, such treatments lead to effects that are naturally described by 
curves (e.g., dose-response curves) rather than scalars, as might be the case for binary treatments. Two major 
methodological challenges in continuous treatment settings are (1) to allow for flexible estimation of the dose- 
response curve (for example to discover underlying structure without imposing a priori shape restrictions), 
and (2) to properly adjust for high-dimensional confounders (i.e., pre-treatment covariates related to treatment 
assignment and outcome). 

Consider a recent example involving the Hospital Readmissions Reduction Program, instituted by the Cen¬ 
ters for Medicare & Medicaid Services in 2012, which aimed to reduce preventable hospital readmissions by 


penalizing hospitals with excess readmissions. McHugh et al. (20131 were interested in whether nurse staffing 
(measured in nurse hours per patient day) affected hospitals’ risk of excess readmissions penalty. The left panel 
of Figure[T]shows data for 2976 hospitals, with nurse staffing (the ‘treatment’) on the x-axis, whether each hos¬ 
pital was penalized (the outcome) on the y-axis, and a loess curve fit to the data (without any adjustment). One 
way to characterize effects is to imagine setting all hospitals’ nurse staffing to the same level, and seeing if 
changes in this level yield changes in excess readmissions risk. Such questions cannot be answered by simply 
comparing hospitals’ risk of penalty across levels of nurse staffing, since hospitals differ in many important 
ways that could be related to both nurse staffing and excess readmissions (e.g., size, location, teaching status, 
among many other factors). The right panel of Figure[T]displays the extent of these hospital differences, show¬ 
ing for example that hospitals with more nurse staffing are also more likely to be high-technology hospitals 
and see patients with higher socioeconomic status. To correctly estimate the effect curve, and fairly compare 
the risk of readmissions penalty at different nurse staffing levels, one must adjust for hospital characteristics 
appropriately. 

In practice, the most common approach for estimating continuous treatment effects is based on regression 


modeling of how the outcome relates to covariates and treatment (e.g., Imbens (2004 1 , Hill (20111). However, 
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Average nursing hours per patient day 



Fig. 1. Left panel: Observed treatment and outcome data with unadjusted loess fit. Right panel: Average 
covariate value as a function of exposure, after transforming to percentiles to display on common scale. 


this approach relies entirely on correct specification of the outcome model, does not incorporate available 
information about the treatment mechanism, and is sensitive to the curse of dimensionality by inheriting the 
rate of convergence of the outcome regression estimator. Hirano and Imbens (20041, Imai and van Dyk| ( |2004| ), 
and Galvao and Wang ( |2015| ) adapted propensity score-based approaches to the continuous treatment setting, 
but these similarly rely on correct specification of at least a model for treatment (e.g., the conditional treatment 
density). 

In contrast, semiparametric doubly robust estimators (Robins and Rotnitzkyj [200 1[ van der Laan and 


Robins! 20031 are based on modeling both the treatment and outcome processes and, remarkably, give consis¬ 


tent estimates of effects as long as one of these two nuisance processes is modeled well enough (not necessarily 
both). Beyond giving two independent chances at consistent estimation, doubly robust methods can also at¬ 
tain faster rates of convergence than their nuisance (i.e., outcome and treatment process) estimators when both 
models are consistently estimated; this makes them less sensitive to the curse of dimensionality and can allow 
for inference even after using flexible machine learning-based adjustment. However, standard semiparametric 
doubly robust methods for dose-response estimation rely on parametric models for the effect curve, either by 
explicitly assuming a parametric dose-response curve (| Robins[ 2000 van der Laan and Robins 20031, or else 
by projecting the true curve onto a parametric working model ( [Neugebauer and van der Laan 20071. Unfor¬ 
tunately, the first approach can lead to substantial bias under model misspecification, and the second can be of 
limited practical use if the working model is far away from the truth. 

Recent work has extended semiparametric doubly robust methods to more complicated nonparametric and 
high-dimensional settings. In a foundational paper, yan der Laan and Dudoit| ( |2003| ) proposed a powerful cross- 
validation framework for estimator selection in general censored data and causal inference problems. Their 
empirical risk minimization approach allows for global nonparametric modeling in general semiparametric 
settings involving complex nuisance parameters. For example, Diaz and van der Laan (20131 considered 
global modeling in the dose-response curve setting, and developed a doubly robust substitution estimator of 
risk. In nonparameric problems it is also important to consider non-global learning methods, e.g., via local and 
penalized modeling (Gyorfi et al.j 20021. Rubin and van der Laan (2005[ 2006a|b I proposed extensions to such 





























































Continuous treatment effects 3 

paradigms in numerous important problems, but the former considered weighted averages of dose-response 
curves and the latter did not consider doubly robust estimation. 

In this paper we present a new approach for causal dose-response estimation that is doubly robust without 
requiring parametric assumptions, and which can naturally incorporate general machine learning methods. The 
approach is motivated by semiparametric theory for a particular stochastic intervention effect and a correspond¬ 
ing doubly robust mapping. Our method has a simple two-stage implementation that is fast and easy to use with 
standard software: in the first stage a pseudo-outcome is constructed based on the doubly robust mapping, and 
in the second stage the pseudo-outcome is regressed on treatment via off-the-shelf nonparametric regression 
and machine learning tools. We provide asymptotic results for a kernel version of our approach under weak 
assumptions, which only require mild smoothness conditions on the effect curve and allow for flexible data- 
adaptive estimation of relevant nuisance functions. We also discuss a simple method for bandwidth selection 
based on cross-validation. The methods are illustrated via simulation, and in the study discussed earlier about 
the effect of hospital nurse staffing on excess readmission penalties. 


2. Background 

2.1. Data and notation 


Suppose we observe an independent and identically distributed sample (Zi, ..., Zn) where Z = (L, A, Y) has 
support Z = {C X Axy). Here L denotes a vector of co variates, A a continuous treatment or exposure, and 


19741, 


Y some outcome of interest. We characterize causal effects using potential outcome notation (Rubin 
and so let denote the potential outcome that would have been observed under treatment level a. 

We denote the distribution of Z by P, with density p(z) = p{y \ 1, a)p{a \ l)p(l) with respect to some 
dominating measure. We let denote the empirical measure so that empirical averages n~^ /(Z,) can 

be written as P„{/(Z)} = f f(z)dFn(z). To simplify the presentation we denote the mean outcome given 
covariates and treatment with ^(1, a) = E(y | L = 1, A = o), denote the conditional treatment density 
given covariates with 7r(a | 1) = ^P(A < a | L = 1), and denote the marginal treatment density with 
~ — ®)- we use ||/|| = {f /(z)^(iP(z)}^/^ to denote the L 2 {P) norm, and we use 

\x = sup 3 ,g_:f |/(x)| to denote the uniform norm of a generic function / over x £ X. 


2.2. Identification 

In this paper our goal is to estimate the effect curve 6{a) = E(y“). Since this quantity is defined in ferms 
of potential outcomes that are not directly observed, we must consider assumptions under which it can be 
expressed in terms of observed data. A full treatment of identification in the presence of continuous random 
variables was given by Gill and Robins (20011; we refer the reader there for details. The assumptions most 
commonly employed for identification are as follows (the following must hold for any o G ^ at which 9{a) is 
to be identified). 


Assumption 1. Consistency: A = a implies Y = Y^. 

Assumption 2. Positivity: 7r(a | 1) > vr^m > Ofor all 1 G C. 

Assumption 3. Ignorability: E(y“ | L, A) = E(y“ | L). 

Assumptions 1-3 can all be satisfied by design in randomized trials, but in observational studies they may be 
violated and are generally untestable. The consistency assumption ensures that potential outcomes are defined 
uniquely by a subject’s own treatment level and not others’ levels (i.e., no interference), and also not by the way 
treatment is administered (i.e., no different versions of treatment). Positivity says that treatment is not assigned 
deterministically, in the sense that every subject has some chance of receiving treatment level o, regardless of 
covariates; this can be a particularly strong assumption with continuous treatments. Ignorability says that the 
mean potential outcome under level a is the same across treatment levels once we condition on covariates (i.e., 
treatment assignment is unrelated to potential outcomes within strata of co variates), and requires sufficiently 
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many relevant covariates to be collected. Using the same logic as with discrete treatments, it is straightforward 
to show that under Assumptions 1-3 the effect curve 9{a) can be identified with observed data as 

0(a) =E{//(L,a)} = y/i(l,a) dP(l). (1) 

Even if we are not willing to rely on Assumptions 1 and 3, it may often still be of interest to estimate 0(a) as 
an adjusted measure of association, defined purely in terms of observed data. 


3. Main results 

In this section we develop doubly robust estimators of the effect curve 0(a) without relying on parametric 
models. First we describe the logic behind our proposed approach, which is based on finding a doubly robust 
mapping whose conditional expectation given treatment equals the effect curve of interest, as long as one of 
two nuisance parameters is correctly specified. To find this mapping, we derive a novel efficient influence 
function for a stochastic intervention parameter. Our proposed method is based on regressing this doubly 
robust mapping on treatment using off-the-shelf nonparametric regression and machine learning methods. We 
derive asymptotic properties for a particular version of this approach based on local-linear kernel smoothing. 
Specifically, we give conditions for consistency and asymptotic normality, and describe how to use cross- 
validation to select the bandwidth parameter in practice. 


3.1. Setup and doubly robust mapping 

If 0(a) is assumed known up to a finite-dimensional parameter, for example 0(a) = 'ipQ+tpia for {'ipQ, ipi) £ , 

then standard semiparametric theory can be used to derive the efficient influence function, from which one can 
obtain the efficiency bound and an efficient estimator ( Bickel et al.[ 1993 van der Laan and Robins 2003| 


Tsiatis 20061. However, such theory is not directly available if we only assume, for example, mild smoothness 
conditions on 0(a) (e.g., differentiability). This is due to the fact that without parametric assumptions 0(o) is 
not pathwise differentiable, and root-n consistent estimators do not exist (Bickel et al. 1993 Dfaz and van der| 
Laan[ 2013| |. In this case there is no developed efficiency theory. 

To derive doubly robust estimators for 0(a) without relying on parametric models, we adapt semiparametric 
theory in a novel way similar to the approach of Rubin and van der Laan| ( |2005 2006a I. Our goal is to find a 
function ^(Z; tt, p) of the observed data Z and nuisance functions (vr, p) such that 


E{^(Z;7r,/r) | A = a} = 0(a) 


if either Ir = tt or JI = p (not necessarily both). Given such a mapping, off-the-shelf nonparametric regression 
and machine learning methods could be used to estimate 0(a) by regressing ^(Z; vr, jj) on treatment A, based 
on estimates vr and (i. 

This doubly robust mapping is intimately related to semiparametric theory and especially the efficient in¬ 
fluence function for a particular parameter. Specifically, if E{^(Z; \ A = a} = 0(a) then it follows that 

E{^(Z;w,/l)} = V'for 

'ip= / p{l,a)w{a)dP{l)da. (2) 

Ja Jc 


This indicates that a natural candidate for the unknown mapping ^(Z; vr, p) would be a component of the effi¬ 
cient influence funclion for the parameter yi, since for regular parameters such as ijj in semi- or non-parametric 
models, the efficient influence function (/>(Z; tt, p) will be doubly robusf in the sense that E{iji(Z; vf, /!)} = 0, 
if either if = tt or J1 = p ( [Robins and Rotnitzky[ [200 1[ [van der Laan and Robins[ [2003 j l. This implies 
E{(/)(Z; TT, p)} = E{^(Z; tt, p) — Tp} =0 so that E{^(Z; Tf,p)} = 'ip if either If = tt or p = p. This kind 
of logic was first used by [Rubin and van der Laan ( [2005 [ [2006a[ l for full data parameters that are functions of 
covariates rather than treatment (i.e., censoring) variables. 
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The parameter if; is also of interest in its own right. In particular, it represents the average outcome under an 
intervention that randomly assigns treatment based on the density ro (i.e., a randomized trial). Thus comparing 
the value of this parameter to the average observed outcome provides a test of treatment effect; if the values 
differ significantly, then there is evidence that the observational treatment mechanism impacts outcomes for at 
least some units. Stochastic interventions were discussed by Dfaz and van der Laan (20121, for example, but 
the efficient influence function for ij: has not been given before under a nonparametric model. Thus in Theorem 
1 below we give the efficient influence function for this parameter respecting the fact that the marginal density 
w is unknown. 


Theorem 1. Under a nonparametric model, the efficient influence function for defined in ([^ is ^(Z; vr, p) — 
a) — p(l, a)dP(l)}-n7(a)da, where 




M A I 1) <iP(l) + M(l, dP(i). 


A proof of Theorem 1 is given in the Appendix (Section 2). Importantly, we also prove that the function 
^(Z; TT,^) satisfies ifs desired double robusfness properly, i.e., fhaf E{^(Z;^, p) \ A = a} = 6{a) if either 
W = TT or JI = p. As mentioned earlier, this motivates estimating the effect curve 6{a) by estimating the 
nuisance functions (vr, ^), and then regressing the estimated pseudo-outcome 


= f 7r(A I 1) dP„(l) + f /i(l, A) dP^l) 

7r(A I L) Jc Jc 

on treatment A using off-the-shelf nonparametric regression or machine learning methods. In the next subsec¬ 
tion we describe our proposed approach in more detail, and analyze the properties of an estimator based on 
kernel estimation. 


3.2. Proposed approach 

In the previous subsection we derived a doubly robust mapping f{Z; vr, p) for which E{^(Z; 7f,Jl)\A = a} = 
9{a) as long as either W = it or ]I = p. This indicates that doubly robust nonparametric estimation of 9{a) 
can proceed with a simple two-step procedure, where both steps can be accomplished with flexible machine 
learning. To summarize, our proposed method is: 

1. Estimate nuisance functions (vr, p) and obtain predicted values. 

2. Construct pseudo-outcome ^(Z; vr, p) and regress on treatment variable A. 


We give sample code implementing the above in the Appendix (Section 9). 

In what follows we present results for an estimator that uses kernel smoothing in Step 2. Such an approach 


is related to kernel approximation of a full-data parameter in censored data settings. Robins and Rotnitzky 


(2001|) gave general discussion and considered density estimation with missing data, while van der Laan and 


Robins (19981, van der Laan and Yu (20011, and van der Vaart and van der Laan (20061 used the approach for 


(20101 used it implicitly for nonparametric regression with missing 


current status survival analysis; Wang et al. 
outcomes. 

As indicated above, however, a wide variety of flexible methods could be used in our Step 2, including 
local partitioning or nearest neighbor estimation, global series or spline methods with complexity penalties, or 
cross-validation-based combinations of methods, e.g.. Super Learner ( van der Laan et al(j 2007| |. In general we 
expect the results we report in this paper to hold for many such methods. To see why, let 9 denote the proposed 
estimator described above (based on some initial nuisance estimators (vf, p) and a particular regression method 
in Step 2), and let 9 denote an estimator based on an oracle version of the pseudo-outcome f{Z;Tf,p) where 
{Tf,p) are the unknown limits to which the estimators (vr,/)) converge. Then ||0 — 0|| < ||0 — 0||-|-||0 — 0||, 
where the second term on the right can be analyzed with standard theory since 0 is a regression of a simple 
fixed function f{Z]W,p) on A, and the first term will be small depending on the convergence rates of vr and p. 
A similar point was discussed by Rubin and van der Laan (2005[ 2006a||. 
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The local linear kernel version of our estimator is 6h{n) = gha{a )^where gha{t) = (1, and 


Phi'^) = argmin 
/3eR2 


Kha{A)^i{Z;^,fl) - ghaiAyp'^ 


(3) 


for Kha{t) = h ^iT{(f—a)//i} with it'a standard kernel function (e.g., a symmetric prohahility density) and/i 
a scalar bandwidth parameter. This is a standard local linear kernel regression of ^(Z; vr, /i) on A. For overviews 
of kernel smoothing see, e.g., Fan and Gijhels ( 1996| ), Wasserman ( |2006 1, and |Li and Racine ( |2007| ). Under 
near-violations of positivity, the above estimator could potentially lie outside the range of possible values for 
6{a) (e.g., if Y is binary); thus we present a targeted minimum loss-based estimator (TMLE) in the Appendix 
(Section 4), which does not have this problem. Alternatively one could project onto a logistic model in (3). 


3.3. Consistency of kernel estimator 

In Theorem 2 below we give conditions under which the proposed kernel estimator 6h{a) is consistent for 6{a), 
and also give the corresponding rate of convergence. In general this result follows if the bandwidth decreases 
with sample size slowly enough, and if either of the nuisance functions vr or /r is estimated well enough (not 
necessarily both). The rate of convergence is a sum of two rates: one from standard nonparametric regression 
problems (depending on the bandwidth h), and another coming from estimation of the nuisance functions tt 
and fj- 

Theorem 2. Let tt and JL denote fixed functions to which vr and converge in the sense that | Ivr — vf 11^ = Op(l) 
and ll/i — /f ||2 = Op(l), and let a ^ A denote a point in the interior of the compact support A of A. Along 
with Assumption 2 (Positivity), assume the following.■ 

(a) Either Tf = tt orjl = p,. 

(b) The bandwidth h = hn satisfies /i —0 and nh^ oo as n ^ oo. 

(c) K is a continuous symmetric probability density with support [—1,1]. 

(d) 9{a) is twice continuously differentiable, and both w{a) and the conditional density o/^(Z;7",/Z) given 
A = a are continuous as functions of a. 

(e) The estimators (vr,/I) and their limits [tt,]!) are contained in uniformly bounded function classes with 
finite uniform entropy integrals (as defined in Section 5 of the Appendix), with l/vr and 1 /vf also uniformly 
bounded. 


Then 


where 


|«„(a)-«(<!)! =0, 


( ' 


\s/rih 


+ h^ + rnia)snia) 


sup ||7r(t I L) — 7r(f | L)|| = Op(r(n)'] 

t\\t—a\<h ^ ^ 

sup \\p(L,t) - p{L,t)\\ = Op(s{n)) 

t\\t—a\<h ^ ^ 

are the ‘local’ rates of convergence offr and p near A = a. 

A proof of Theorem 2 is given in the Appendix (Section 6). The required conditions are all quite weak. 
Condition (a) is arguably the most important of the conditions, and says that at least one of the estimators tt or 
p must be consistent for the true vr or ^ in terms of the uniform norm. Since only one of the nuisance estimators 
is required to be consistent (not both). Theorem 2 shows the double robustness of the proposed estimator 6h{a). 
Conditions (b), (c), and (d) are all common in standard nonparametric regression problems, while condition (e) 
involves the complexity of the estimators tt and p (and their limits), and is a usual minimal regularity condition 
for problems involving nuisance functions. 

Condition (b) says that the bandwidth parameter h decreases with sample size but not too quickly (so 
that nh^ —)• oo). This is a standard requirement in local linear kernel smoothing (Fan and Cijbels 1996 
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Wasserman 


2006 Li and Racine 2007|. Note that since nh = nh^/fi^, it is implied that nh oo; thus 


one can view nh as a kind of effective or local sample size. Roughly speaking, the bandwidth h needs to go 
to zero in order to control bias, while the local sample size nh (and nh^) needs to go to infinity in order to 
control variance. We postpone more detailed discussion of the bandwidth parameter until a later subsection, 
where we detail how it can be chosen in practice using cross-validation. Condition (c) puts some minimal 
restrictions on the kernel function. It is clearly satisfied for most common kernels, including the uniform 
kernel K{u) = I{\u\ < l)/2, the Epanechnikov kernel K{u) = (3/4)(l — u‘^)I{\u\ < 1), and a truncated 
version of the Gaussian kernel iL(?x) = I{\u\ < l)(^(ri)/{2$(l) — 1} with (/> and the density and distribution 
functions for a standard normal random variable. Condition (d) restricts the smoothness of the effect curve 
9{a), the density of w{a), and the conditional density given A = aof the limiting pseudo-outcome ^(Z; 7", ]£). 
These are standard smoothness conditions imposed in nonparametric regression problems. By assuming more 
smoothness of 9{a), bias-reducing (higher-order) kernels could achieve faster rates of convergence and even 
approach the parametric root-n rate (see for example Fan and Gijbels ( |1996 l, Wasserm^ ( 2006| ), and others). 

Condition (e) puts a mild restriction on how flexible fhe nuisance estimators (and fheir corresponding lim¬ 
its) can be, although such uniform entropy conditions still allow for a wide array of data-adaptive estimators 


and, importantly, do not require the use of parametric models. Andrews (1994 1 (Section 4), van der Vaart and 


Wellner| (1996l (Sections 2.6-2.7), and van der Vaart (2000l (Examples 19.6-19.12) discuss a wide variety of 


function classes with finite uniform entropy integrals. Examples include standard parametrie classes of func¬ 
tions indexed by Euelidean parameters (e.g., parametric functions satisfying a Eipschitz condition), smooth 
functions with uniformly bounded partial derivatives, Sobolev classes of functions, as well as convex combi¬ 
nations or Eipschitz transformations of any such sets of funetions. The uniform entropy restrietion in eondition 
(e) is therefore not a very strong restriction in practice; however, it could be further weakened via sample 
splitting techniques (see Chapter 27 of van der Eaan and Rose (20111). 

The convergence rate given in the result of Theorem 2 is a sum of two eomponents. The first, 1 / ^/nh + h?, 
is the rate achieved in standard nonparametrie regression problems without nuisanee funetions. Note that if h 
tends to zero slowly, then \j\fnh will tend to zero quickly but will tend to zero more slowly; similarly if 
h tends to zero quickly, then I? will as well, but will tend to zero more slowly. Balancing these two 

terms requires h ~ n~^l^ so that ~ n~‘^l^. This is the optimal pointwise rate of convergenee 

for standard nonparametrie regression on a single covariate, for a twiee eontinuously differentiable regression 
function. 

The second component, rn(a)s„(a), is the product of the local rates of eonvergence (around A = a) of the 
nuisanee estimators vr and /I towards their targets vr and p. Thus if the nuisance funetion estimates eonverge 
slowly (due to the curse of dimensionality), then the convergence rate of 9h{a) will also be slow. However, 
sinee the term is a product, we have two chanees at obtaining fast convergenee rates, showing the bias-redueing 
benefit of doubly robust estimators. The usual explanation of double robustness is that, even if fi is misspecified 
so that Sn{a) = 0(1), then as long as vf is eonsistent, i.e., rn(a) = o(l), we will still have consistency since 
rnia)sn{a) = o(l). But this idea also extends to settings when both vr and p are consistent. For example 
suppose h ~ so that l/^/nh + h? and suppose vr and ft are locally consistent with rates 

r„(a) = and Sn{a) = n“^Oo_ Then the produet is rn{a)sn{a) = 0(n“^/^) = o{n~ ^O) and the 

eontribution from the nuisanee funetions is asymptotieally negligible, in the sense that the proposed estimator 
has the same eonvergenee rate as an infeasible estimator with known nuisanee funetions. Contrast this with 
non-doubly-robust plug-in estimators whose eonvergenee rate generally matehes that of the nuisanee funetion 
estimator, rather than being faster (van der Vaart 20141. 

In Seetion 8 of the Appendix we give some diseussion of uniform eonsisteney, whieh, along with weak 
eonvergenee, will be pursued in more detail in future work. 


3.4. Asymptotic normality of kernel estimator 

In the next theorem we show that if one or both of the nuisance functions are estimated at fast enough rates, 
then the proposed estimator is asymptotieally normal after appropriate scaling. 
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Theorem 3. Consider the same setting as Theorem 2. Along with Assumption 2 (Positivity) and conditions 
(a)—(e) from Theorem 2, also assume that: 


(f) The local convergence rates satisfy rn{a)sn{a) = Op{l/s/nh). 


Then 


^^{k^a)-f,^a) + K^a)]yNU 

I i \ w[a) 

where bh{a) = 9”{a){hf /2) f u^K(u) du + o{h?), and 


cr^(a) = E 


r^(L, a) + {//(L, a) - //(L, g)}^ 
{^(a I L)/ro(a)}2/{7r(a | 'L)/w{a)} 


|0(a) — m(a)| 


for r^(l, a) = var{Y | L = 1, ^ = a), ^{a) = E{7r(o | L)}, m(o) = E{^(L, a)}. 

The proof of Theorem 3 is given in the Appendix (Section 7). Conditions (a)-(e) are the same as in Theorem 
2 and were discussed earlier. Condition (f) puts a restriction on the local convergence rates of the nuisance 
estimators. This will in general require at least some semiparametric modeling of the nuisance functions. Truly 
nonparametric estimators of vr and p, will typically converge at slow rates due to the curse of dimensionality, 
and will generally not satisfy the rate requirement in the presence of multiple continuous covariates. Condition 
(f) basically ensures that estimation of the nuisance functions is irrelevant asymptotically; depending on the 
specific nuisance estimators used, it could he possible to give weaker but more complicated conditions that 
allow for a non-negligible asymptotic contribution while still yielding asymptotic normality. 

Importantly, the rate of convergence required by condition (g) of Theorem 3 is slower than the root-n rate 
typically required in standard semiparametric settings where the parameter of interest is finite-dimensional 
and Euclidean. For example, in a standard setting where the support A is finite, a sufficient condition for 
yielding the requisite asymptotic negligibility for attaining efficiency is r„(a) = Sn{a) = however 


in our seffing the weaker condition r„(a) = Sn(a) = o{n would be sufficient if h 


~ n 


- 1/5 


. Similarly, 


if one nuisance estimator tt or /i is computed with a correctly specified generalized additive model, then the 
other nuisance estimator would ony need to be consistent (without a rate condition). This is because, under 
regularity c onditions and w ith optimal smoothing, a generalized additive model estimator converges at rate 
(Horowitz 20091, so that if the other nuisance estimator is merely consistent we have rn{a)sn{a) = 
0(n“^/®)o(l) = o(n“^/^), which satisfies condition (f) as long as /i ~ In standard settings such 

flexible nuisance estimation would make a non-negligible contribution to the limiting behavior of the estimator, 
preventing asymptotic normality and root-n consistency. 

Under the assumptions of Theorem 3, the proposed estimator is asymptotically normal after appropriate 
scaling and centering. However, the scaling is by the square root of the local sample size s/nh rather than the 
usual parametric rate This slower convergence rate is a cost of making fewer assumptions (equivalently, 
the cost of better efficiency would be less robusfness); fhus we have a fypical bias-variance frade-off. As in 
sfandard nonparamefric regression, fhe esfimafor is consisfenf buf nol quife centered al 0(o); Ihere is a bias 
term of order 0{h?), denoted hh{a). In facl fhe esfimafor is centered al a smoofhed version of fhe effecl 
curve 0^(a) = + bh{a). This phenomenon is ubiquitous in nonparamefric regression, 

and complicales fhe process of computing confidence bands. If is sometimes assumed fhal fhe bias term is 
o{l/s/^) and fhus asymptotically negligible (e.g., by assuming h = so lhat nh^ —)• 0); fhis is called 

undersmoolhing and lechnically allows for fhe conslrucfion of valid confidence bands around 9{a). However, 
Ihere i s liffle guidance abo uf how fo acfually undersmoolh in pracfice, so if is mosfly a technical device. We 


follow Wasserman (20061 and olhers by expressing uncerlainly abouf fhe esfimafor 9h{a) using confidence 
intervals centered at the smoothed data-dependent parameter 0^(a). For example, under the conditions of 
Theorem 3, pointwise Wald 95% confidence intervals can be constructed with 9h{a) ± lAQd/^/n, where cf^ is 
the (1,1) element of the sandwich variance estimate Fn{T>ha{'^)^‘^} based on the estimated efficient influence 
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function for (3h{a) given by 
‘^ha(Z) = 


g,ha{A)Kha{A)^i{Z]n,fL) - $h{o)] 

+ f gha{t)Khait)\^K'^,t) - rh{t)'^w{t) dt 


for Dfta = Fn{gha{A)Kha{A)gU, m{t) = P„{/i(L,f)}, w{t) = Fn{TT{t \ L)}. 


3.5. Data-driven bandwidth seiection 


The choice of smoothing parameter is critical for any nonparametric method; too much smoothing yields large 
biases and too little yields excessive variance. In this subsection we discuss how to use cross-validation to 
choose the relevant bandwidth parameter h. In general the method we propose parallels those used in standard 
nonparametric regression settings, and can give similar optimality properties. 

We can exploit the fact that our method can be cast as a non-standard nonparametric regression problem, 
and borrow from the wealth of literature on bandwidth selection there. Specifically, the logic behind Theorem 
3 (i.e., that nuisance function estimation can be asymptotically irrelevant) can be adapted to the bandwidth 
selection setting, by treating the pseudo-outcome ^(Z; tt, /i) as known and using for example the bandwidth 
selection framework from |Hardle et al.| (|1988|l. These authors proposed a unified selecfion approach fhaf 


includes generalized cross-validafion, Akaike’s informalion criferion, and leave-one-ouf cross-validafion as 
special cases, and showed fhe asympfolic equivalence and opfimalify of such approaches. In our selling, leave- 
one-ouf cross-validafion is alfraclive because of ils compulafional ease. The simplesl analog of leave-one-ouf 
cross-validafion for our problem would be lo selecl fhe oplimal bandwidfh from some sel T-L wilh 


n 


hopt = arg min 
h&H 


f TT, /i) — Oh{Ai 
1 - WhiA^) 


where = {l,0)Fn{ghaY^)^haA^)ShaA^y} ^A:(0) is fhe i*'" diagonal of fhe so-called 

smoolhing or hal malrix. The properties of this approach can be derived using logic similar to that in Theorem 
3, e.g., by adapting results from Li and Racine ( |2004| ). Alternatively one could split the sample, estimate the 
nuisance functions in one half, and then do leave-one-out cross-validation in the other half, treating the pseudo¬ 
outcomes estimated in the other half as known. We expect these approaches to be asymptotically equivalent to 
an oracle selector. 


An alternative option would be to use the fe-fold cross-validation approach of van der Laan and Dudoit 


(20031 or Dfaz and van der Laan (20131. This would entail randomly splitting the data into k parts, estimating 
the nuisance functions and the effect curve on {k — 1) training folds, using these estimates to compute measures 
of risk on the test fold, and then repeating across all k folds and averaging the risk estimates. One would 
then repeat this process for each bandwidth value h in some set T-L, and pick that which minimized the estimated 


cross-validated risk, van der Laan and Dudoit (20031 gave finite-sample and asymptotic results showing that 


the resulting estimator behaves similarly to an oracle estimator that minimizes the true unknown cross-validated 
risk. Unfortunately this cross-validation process can be more computationally intensive than the above leave- 
one-out method, especially if the nuisance functions are estimated with flexible compulalion-heavy mefhods. 
However fhis approach will be crucial when incorporafing general machine learning and moving beyond linear 
kernel smoofhers. 
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4. Simulation study 


We used simulation to examine the finite-sample properties of our proposed methods. Specifically we simulated 
from a model wifh normally disfrihufed covariafes 

L = (Li,...,L4r ~A^(0,l4), 


Bela disfrihufed exposure 


(A/20) I L ~ Befa{A(L), 1 - A(L)}, 
logil A(L) = -0.8 -h O.lLi -h O.IL 2 - O.IL 3 -h O. 2 L 4 , 


and a binary oulcome 


F I L, A ~ Bernoulli{/i(L, A)}, 

logif /i(L, A) = 1 + (0.2, 0.2,0.3, -0.1)L + A(0.1 - O.lLi + O.IL 3 - 0.13^A^). 

The above selup roughly mafches fhe dafa example from fhe nexl seclion. Figure shows a plof of Ihe effecl 
curve 9{a) = E{/r(L, a)} induced by fhe simulation setup, along with treatment versus outcome data for one 
simulated dataset (with n = 1000). 

To analyze the simulated data we used three different estimators: a marginalized regression (plug-in) esti¬ 
mator m(a) = P„{/i(L, a)}, and two different versions of the proposed local linear kernel estimator. Specif¬ 
ically we used an inverse-probability-weighted approach first developed by [Rubin and van der Laan ( 2006b[ ), 
which relies solely on a treatment model estimator tt (equivalent to setting /t = 0 ), and the standard doubly 
robust version that used both estimators tt and ji. To model the conditional treatment density tt we used logistic 
regression to estimate the parameters of the mean function A(l); we separately considered correctly specifying 
this mean function, and then also misspecifying the mean function by transforming the covariates with the 


same covariate transformations as in Kang and Schafer (20071. To estimate the outcome model /r we again 
used logistic regression, considering a correctly specified model and fhen a misspecified model fhaf used fhe 
same Iransformed covariafes as wifh tt and also lefl ouf fhe cubic term in a (buf kepi all ofher inferaclions). 
To selecl fhe bandwidlh we used fhe leave-one-oul approach proposed in Seclion 3.5, which Ireals fhe pseudo- 
oulcomes as known. For comparison we also considered an oracle approach lhal picked fhe bandwidlh by 
minimizing fhe oracle risk P„[{0(A) — 0/i(A)}^]. In bolh cases we found fhe minimum bandwidlh value in fhe 
range 77 = [0.01, 50] using numerical oplimizalion. 

We generated 500 simulated dafasels for each of fhree sample sizes, n = 100, 1000, and 10000. To assess 
fhe qualify of fhe eslimales across simulalions we calculated empirical bias and roof mean squared error al each 
poinl, and inlegraled across fhe function wifh respecl lo fhe densily of A. Specifically, telling Og (a) denole fhe 
estimated curve al poinl a in simulation s {s = 1,S wifh S = 500), we estimated fhe integrated absolute 
mean bias and roof mean squared error wifh 


Bias = 


f I ^ . 

/ -y^6s{a) - 9{a) w{a) da, 

Ja* ^ ^ 


RMSE = 


’ S=1 

S 1 1/2 

I 2 




1 


s 


- 9(a))" 


S=1 


w{a) da. 


In fhe above A* denotes a Irimmed version of fhe supporl of A, excluding 10% of mass al fhe boundaries. Note 
lhal Ihe above integrands (excepl for Ihe densily) correspond lo Ihe usual definitions of absolute mean bias and 
rool mean squared error for estimation of a single scalar parameter (e.g., Ihe curve al a single poinl). 

The simulation resulls are given in Table 1 (bolh fhe integrated bias and rool mean squared error are mul¬ 
tiplied by 100 for easier inlerprelalion). Estimators wilh slars (e.g., IPW*) denote Ihose wilh bandwidlhs 
selected using Ihe oracle risk. When bolh vr and jl were misspecified, all estimators gave subslanlial integrated 
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Treatment A 


Fig. 2. Plot of effect curve induced by simulation setup, with treatment and outcome data from one simulated 
dataset with n = 1000. 


bias and mean squared error (although the doubly robust estimator consistently performed better than the other 
estimators in this setting). Similarly, all estimators had relatively large mean squared error in the small sample 
size setting (n = 100) due to lack of precision, although differences in bias were similar to those at moderate 
and small sample sizes (n = 1000,10000). Specifically the regression estimator gave small bias when fi was 
correct and large bias when fi was misspecified, while the inverse-probability-weighted estimator gave small 
bias when tt was correct and large bias when yf was misspecified. However, the doubly robust estimator gave 
small bias as long as either tt ox fi was correctly specified, even if one was misspecified. 

The inverse-probability-weighted estimator was least precise, although it had smaller mean squared error 
than the misspecified regression estimator for moderate and large sample sizes. The doubly robust estimator 
was roughly similar to the inverse-probability-weighted estimator when the treatment model was correct, but 
gave less bias and was more precise, and was similar to the regression estimator when the outcome model 
was correct (but slightly more biased and less precise). In general the estimators based on the oracle-selected 
bandwidth were similar to those using the simple leave-one-out approach, but gave marginally less bias and 
mean squared error for small and moderate sample sizes. The benefits of the oracle bandwidth were relatively 
diminished with larger sample sizes. 
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Table 1. Integrated bias and root mean squared error (500 simulations 


n 

Method 

Bias (RMSE) when correct model is: 
Neither Treatment Outcome 

Both 

100 

Reg 

2.67 

(5.54) 

2.67 

(5.54) 

0.62 

(5.25) 

0.62 

(5.25) 


IPW 

2.26 

(8.49) 

1.64 

(8.57) 

2.26 

(8.49) 

1.64 

(8.57) 


IPW* 

2.26 

(7.36) 

1.58 

(7.37) 

2.26 

(7.36) 

1.58 

(7.37) 


DR 

2.23 

(6.27) 

1.01 

(6.28) 

1.12 

(5.92) 

1.10 

(6.50) 


DR* 

2.12 

(5.48) 

1.00 

(5.36) 

1.03 

(5.08) 

1.02 

(5.65) 

1000 

Reg 

2.62 

(3.07) 

2.62 

(3.07) 

0.06 

(1.53) 

0.06 

(1.53) 


IPW 

2.38 

(3.97) 

0.86 

(2.94) 

2.38 

(3.97) 

0.86 

(2.94) 


IPW* 

2.11 

(3.44) 

0.70 

(2.34) 

2.11 

(3.44) 

0.70 

(2.34) 


DR 

2.03 

(3.11) 

0.75 

(2.39) 

0.74 

(2.53) 

0.68 

(2.25) 


DR* 

1.84 

(2.67) 

0.64 

(1.88) 

0.61 

(1.78) 

0.58 

(1.78) 

10000 

Reg 

2.65 

(2.70) 

2.65 

(2.70) 

0.02 

(0.47) 

0.02 

(0.47) 


IPW 

2.36 

(3.42) 

0.33 

(1.09) 

2.36 

(3.42) 

0.33 

(1.09) 


IPW* 

2.24 

(3.28) 

0.35 

(0.85) 

2.24 

(3.28) 

0.35 

(0.85) 


DR 

1.81 

(2.35) 

0.26 

(0.86) 

0.20 

(1.21) 

0.25 

(0.78) 


DR* 

1.76 

(2.27) 

0.31 

(0.68) 

0.24 

(1.10) 

0.29 

(0.64) 

Notes: 

Bias / RMSE = integrated mean bias / root mean squared 
error; IPW = inverse probability weighted; Reg = regression; 

DR = doubly robust; * = uses oracle bandwidth. 


5. Application 


In this section we apply the proposed methodology to estimate the effect of nurse staffing on hospital readmis¬ 
sions penalties, as discussed in the Introduction. In the original paper, McHugh et al. ( 2013| ) used a matching 
approach to control for hospital differences, and found that hospitals with more nurse staffing were less likely 
to be penalized; this suggests increasing nurse staffing to help curb excess readmissions. However, their anal¬ 
ysis only considered the effect of higher nurse staffing versus lower nurse staffing, and did not explore the full 
effect curve; it also relied solely on matching for covariate adjustment, i.e., was not doubly robust. 

In this analysis we use the proposed kernel smoothing approach to estimate the full effect curve flexibly, 
while also allowing for doubly robust covariate adjustment. We use the same data on 2976 acute care hospitals 
as in |McHugh et al.| (|2013[); full details are given in the original paper. The covariates L include hospital 


size, teaching intensity, not-for-profit status, urban versus rural location, patient race proportions, proportion of 
patients on Medicaid, average socioeconomic status, operating margins, a measure of market competition, and 
whether open heart or organ transplant surgery is performed. The treatment A is nurse staffing hours, measured 
as the ratio of registered nurse hours to adjusted patient days, and the outcome Y indicates whether the hospital 
was penalized due to excess readmissions. Excess readmissions are calculated by the Centers for Medicare 
& Medicaid Services and aim to adjust for the fact that different hospitals see different patient populations. 
Without unmeasured confounding, the quantity 9{a) represents the proportion of hospitals that would have 
been penalized had all hospitals changed their nurse staffing hours to level a. Otherwise 9{a) can be viewed as 
an adjusted measure of the relationship between nurse staffing and readmissions penalties. 

For the conditional density 7r(a | 1) we assumed a model A = A(L) -|- 7(L)e, where e has mean zero 
and unit variance given the covariates, but otherwise has an unspecified density. We flexibly estimated the 
conditional mean function A(l) = E(A | L = 1) and variance function 7(1) = \?ix{A | L = 1) by combining 
linear regression, multivariate adaptive regression splines, generalized additive models. Lasso, and boosting, 
using the cross-validation-based Super Learner algorithm ( [van der Laan et al. 20071, in order to reduce chances 
of model misspecification. A standard kernel approach was used to estimate the density of e. 

For the outcome regression /r(l, a) we again used the Super Learner approach, combining logistic regres¬ 
sion, multivariate adaptive regression splines, generalized additive models. Lasso, and boosting. To select 























Continuous treatment effects 13 


the bandwidth parameter h we used the leave-one-out approach discussed in Section 3.5. We considered re¬ 
gression, inverse-probability-weighted, and doubly robust estimators, as in the simulation study in Section 4. 
The two hospitals (<0.1%) with smallest inverse-probability weights were removed as outliers. For the dou¬ 
bly robust estimator we also computed pointwise uncertainty intervals (i.e., confidence intervals around the 
smoothed parameter 0*y^[a)', see Section 3.4) using a Wald approach based on the empirical variance of the 
estimating function values. 



Average nursing hours per patient day 


Fig. 3. Estimated effects of nurse staffing on readmissions penalties. 

A plot showing the results from the three estimators (with uncertainty intervals for the proposed doubly 
robust estimator) is given in Figure]^ In general the three estimators were very similar. For less than 5 average 
nurse staffing hours the adjusted chance of penalization was estimated to be roughly constant, around 73%, but 
at 5 hours chances of penalization began decreasing, reaching approximately 60% when nurse staffing reached 
11 hours. This suggests that adding nurse staffing hours may be particularly beneficial in the 5-10 hour range, 
in terms of reducing risk of readmissions penalization; most hospitals (65%) lie in this range in our data. 

6. Discussion 

In this paper we developed a novel approach for estimating the average effect of a continuous treatment; im¬ 
portantly the approach allows for flexible doubly robust covariate adjustment without requiring any parametric 
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assumptions about the form of the effect curve, and can incorporate general machine learning and nonparamet- 
ric regression methods. We presented a novel efficient influence function for a stochastic intervention parameter 
defined within a nonparametric model; this influence function motivated the proposed approach, but may also 
be useful to estimate on its own. In addition we provided asymptotic results (including rates of convergence 
and asymptotic normality) for a particular kernel estimation version of our method, which only require the 
effect curve to be twice continuously differentiable, and allows for flexible data-adaptive estimation of nui¬ 
sance functions. These results show the double robustness of the proposed approach, since either a conditional 
treatment density or outcome regression model can be misspecified and the proposed estimator will still be 
consistent as long as one such nuisance function is correctly specified. We also showed how double robustness 
can result in smaller second-order bias even when both nuisance functions are consistently estimated. Finally, 
we proposed a simple and fast data-driven cross-validation approach for bandwidth selection, found favorable 
finite sample properties of our proposed approach in a simulation study, and applied the kernel estimator to 
examine the effects of hospital nurse staffing on excess readmissions penalty. 

This paper integrates semiparametric (doubly robust) causal inference with nonparametric function esti¬ 
mation and machine learning, helping to bridge the “huge gap between classical semiparametric models and 
the model in which nothing is assumed” ( |van der Vaart[ 2014| ). In particular our work extends standard non¬ 
parametric regression by allowing for complex covariate adjustment and doubly robust estimation, and extends 
standard doubly robust causal inference methods by allowing for nonparametric smoothing. Many interest¬ 
ing problems arise in this gap between standard nonparametric and semiparametric inference, leading to many 
opportunities for important future work, especially for complex non-regular target parameters that are not path- 
wise differentiable. In the context of this paper, in future work it will be useful to study uniform distributional 
properties of our proposed estimator (e.g., weak convergence), as well as its role in testing and inference (e.g., 
for constructing tests that have power to detect a wide array of deviations from the null hypothesis of no effect 
of a continuous treatment). 
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1. Guide to notation 

Z = (L,yl,y) = observed data arising from distribution P with density p{z) = p{y \ l,a)p{a \ l)p(l) and 
support supp(Z) = Z = CxAxy 

IPn = ^ E* = empirical measure so that P„(/) = P„{/(Z)} = ^ Ei /(^i) 

P(/) = P{/(Z)} = dP{z) - expectation for new Z treating / as fixed (so P(/) is random if / de¬ 

pends on sample, in which case P(/) / IE(/)) 

7 r(a I 1) = p{a \ 1) = ^P{A < a | 1) = conditional density of treatment A 

7r(a I 1) = user-specified estimator of 7r(a | 1), which converges to limit 7f(a | 1) that may not equal true tt 
w{a) = p{a) = ^P{A < a) = E{7r(a | L)} = 7r(a | 1) dP(l) - density of A 

w(a) = Pn{^(a I L)} = fc^i^ I 1) ^ Ej^(® I 1*) - estimator of zu, which converges to limit 

m(a) that may not equal true m 

//(I, a) = E(y I L = 1, A = a) = Jyy dP{y | 1, a) = conditional mean outcome 

/i(l, a) - user-specified estimator of /i(l, a), which converges to limit /i(l, a) that may not equal true p 

m{a) = Pn{/r(L,a)} = /i(l, a) dPn(l) = ^ Ej - r^gr^s^iori'hased plug-in estimator of 6*(a), 

which converges to limit m{a) that may not equal true 9 

2. Proof of Theorem 1 

Let p(z; e) be a parametric submodel with parameter e G M and p(z; 0) = p{z), for example p(z; e) = {1 -|- 
e6(z)}p(z) where E{6(Z)} = 0 with |6(Z)| < B and |e| < (1/-B) to ensure thatp(z;e) > 0. For notational 
simplicity we denote {df{t; e)/de}\e=o by /^(t; 0) for any general function / of e and other arguments t. 

By definition the efficient influence function for is the unique function </>(Z) that satisfies V’e(O) = 
E{(/)(Z)£'(Z; 0)}, where ^/)(e) represents the parameter of interest as a functional on the parametric submodel 
and ^(w I w; e) = logp(w | w; e) for any partition (W, W) C Z. Therefore 

4(z; e) = 4(y I h a ;«) + 4(a 11; e) + 4(1; 4- 
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We give two important properties of such score functions | w; e) that will he used throughout this 
proof. First note that since £(w | w; e) is a log transformation of p(w | w; e), it follows that | w; e) = 
p'g(w I w; e)/p(w I w; e) because for general functions / we have d\og f{e)/de = {df{e)/de}/f{e). Simi¬ 
larly, as with any score function, note that Ejf'' (W | W; 0) | W} = 0 since 

f /g(w I w;0) dP{'w I w) = / dP^{w I w) = — f dP{w | w) = 0. 

Jw Jw Jw 

Our goal in this proof is to show that V'e(O) = IE{(/>(Z)£g(Z; 0)} for the proposed influence function 4>{'Zi) = 
^(Z; 7r,iJ,) — ip + J^{;u(L, a) —/r(l, a)dP{l)}w{a)da given in the main text. First we will give an expression 
for V'e(O). By definition V'(e) = 9{a; e)'cu{a; e) da, so 

V'e(O) = [ O)^(q) + 0)} 0) + 0)}- 

Ja 

Also since 6{a; e) = f^fyV p{y \ 1, a; e)p(l; e) dri{y) dn{l), we have 


Therefore 


6 i'(a; 0 ) = J^J^y{peiy I i>«; 0 )p(i) +piy I 

= I l,a;0)p(y I l,a)p(l) +p{y I l,a)<(l;0)p(l)|d7?(y) ciz/(l) 

= E E{Ye[{Y I L,A;0) | L,A = a} + e|^(L, a)<(L;0)}. 

y (0) = ^ [E{y/,(y I L, A; 0) I L, A = a} 

+ e|;u(L, a)<(L; 0)} + 0(a)<(a; 0)^ ti7(a) da. 


Now we will consider the covariance 

E{</)(ZK(Z; 0)} = E[</.(Z){4(y I L, A; 0) + 4(A, L; 0)}], 

which we need to show equals the earlier expression for V’e(O)- 

Recall the proposed efficient influence function given in the main text is 

- ^ ti7(A) + m(A) — Ip + [ |/r(L, a) — m(a)]-ti7(a) da 

7r(A I Lj Ja^ ^ 

where we define 

m{a) = j /u(l, a) dP{l) 

as the marginalized version of the regression function q, so that m{a) = 9{a) if p is the true regression 
function. 

Thus E{(/>(Z)£'(y I L, A;0)} equals 

7 r(A I lI/ct-^A) “^(«)}^(«) da + 9{A) - £'^{Y \ L,A;0)^ 

_ [ y4(y I L, A; 0) ) _ r E{y4(y | l, A; o) | l, a} 1 
\ 7r(A I L)/tz7(A) J [ 7r(A|L)/ro(A) 
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= / E 

Ja 


E{Y£'^(Y I L, A;0) I L,yl = a} w(a) da 


where the first equality follows since E{l'^{Y \ L, A; 0) | L, A} = 0, the second hy iterated expectation condi¬ 
tioning on L and A, and the third hy iterated expectation conditioning on L. Now note that E{(l){7i)l'^{A, L; 0)} 
equals 


E 


Y-^x{l.,A) 

[l7r(yl I Y)/w{A) 


4(A, L; 0) + {OiA) - V^}{4(L | ^; 0) + 4(A; 0)} 




J |^(L, a) — 9{a)'^w{a) da | L; 0) -|- 0)| 


= E 


6{A)l'^{A-,f)) + / |u(L, a)/j,(L; 0)tJ7(a) da 


lA 


since hy definition i'^{A,'L-,0) = i'^{A \ L;0) -|- £g(L;0) = /^(L | yl; 0) -|- f^{A;0), and the equality used 
iterated expectation conditioning on L and A for the first term in the first line, A for the second term in the first 
line, and L for the second line. Adding the expressions E{(f){7j)l'^{Y \ L, A; 0)} and E{i;/)(Z)f''(A, L; 0)} gives 


lA 


E 


E{Y£[{Y I L,A;0) I L,A = a} + ^(L, a)4(L; 0) +d(a)<(a;0) ^ 7 ( 0 ) da, 


which equals V’e(O)- Thus (j) is the efficient influence function. 


3. Double robustness of efficient influence function & mapping 

Here we will show that E{(/)(Z; W, JI, ip)} = 0 if either If = n or JI = ^, where (/>(Z; W, JI, ip) is the influence 
function defined as in the main text as 


C(Z; TT, //) - V’ + ^ |m(L, ^ I 1) dP{\) da, 

where 

[ 7f{A I 1) dP(l) + [ 71 ( 1 , A) dP(l). 

I L) Jc Jc 

First note that, letting w{a) = E{ff (a | L)} and m(a) = E{7Z(L, a)}, we have 


f -7i(l,a) ^ / X 

— _ dF[l \ a) + m[a) 


A = 


Jc I l)/^(«) 

= {d(l,a) 

= G{a) + a) - Ji{\, a)| 


7r(a I l)/t77(a) , 

7r(a I lj/t77(aj 

7r(a I l)/ro(a) 


7r(a I l)/ro(a) 


- 1 dP{l) 


where the first equality follows by iterated expectation, the second follows since p(l | a) = p{a \ \)p{\)/p{a), 
and the third by rearranging. The last line shows that E{^(Z; yf, /u) | A = a} = 9{a) as long as either yf = tt or 
JL = p, since in either case the remainder is zero. 

Therefore if ^ = yr or 7 Z = we have 


lA 


E{,f (Z; yr, /i) I A = a}ty7(a) da — ip = / 9{a)w{a) da — ip = f) 


lA 
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so that 


E{0(Z;7r,/x,'i/')} = E 


J |/x(L, a) — m(a)|ti 7 (a) da 


But 


E J ^^(L,a) — m{a)'^'uj{a) da = J |m(a) — m(o)|tu(a) da = 0 


by definition. 

Therefore E{(/)(Z; 7f,]I,Tp)} = 0 if either n = tt or JI = ^. 


4. TMLE version of estimator 

As we note in the main text, the proposed estimator 

dh{a) = gha{aV^n{gha{A)Kha{^)gha{Ay}~^^n{ghaKha{A)i{Z-,^,fl)} 

is not guaranteed to respect bounds on Y, e.g., if 1" G [0,1] is binary. If some observations have very small 
values of the denominator quantity Tr{A \ 'L)/w{A) then the estimator could be unstable and may take values 
outside the range of Y. Targeted maximum likelihood or minimum loss-based estimators (TMLEs), developed 


by van der Laan and Rubin (2006 1 , help combat this problem (see discussion for example in van der Laan 


and Rose (2011 1 and elsewhere). In this section we present a TMLE that should give better finite-sample 


performance, for example, when there are near-violations of the positivity assumption. 

Our proposed TMLE can be fit as follows. Eirst estimate the nuisance functions tt and jl, for example with 
flexible machine learning (e.g.. Super Learner). Then fit a logistic regression model regressing Y on ‘clever 
covariate’ vector 

A IT /t^ gha{A)Kha{A) 

with logit{/i(L, A)} included as an offset (and no intercept term). This ensures 

gha{A)Kha{A) 


Y — expit 


logit{/i(L, A)} -h e^c/ia(L, A) 


= 0 


^ 7 r(A I L)/ro(A) 

where e = (ei, 62 ) are the parameters in the logistic regression fit. Now define 

AL(L, A) = expit logit{/i(L, A)} -f e^Cha{^, A) 

Then the proposed method proceeds as before, simply replacing predicted values /i(L, A) with /i^^(L, A). 
Specifically we esfimate 9 {a) with 

9l{a) = gha{aY'^n{gha{A)Kha{A)gha{AY]~^'S‘n{ghaKha{A)rhl^{A)], 

'mlaii) = IPn{AL(L,f)} = En (expit logit{/i(L,f)} -f i'^Cha{'L,t) 

The above TMLE is somewhat more complicated to fit than the estimator proposed in the main text. An alterna¬ 
tive approach that would also respect bounds on Y would be to estimate d{a) with 6h{a) = expit{g/ia(a)^/3;j(o)} 
where 


Ph{a) = argmin 
/3eR= 


Kha{A) ^(Z; 7 r,/i) - expit{g/ia(A)^/3} 


1 2 


Another simple option would be to use the original estimator from the main text and project onto the range of 
possible Y values. 
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5. Stochastic equicontinuity lemmas 

In this section we discuss the concept of asymptotic or stochastic equicontinuity, and give two lemmas that 
play a central role in subsequent proofs. 

Let Gn = ^/n{¥n — IP). A sequence of empirical processes {GnVn{f) : / G fF} indexed hy elements / 
ranging over a metric space T (equipped with semimetric p) is stochastically equicontinuous (|Pollard| [1984j 
Andrews 1994 van der Vaart and Wellner| 1996 1 if for every e > 0 and C > 0 there exists a <5 > 0 such that 


n 


> 


1 | is 


1984 


limsupp^ sup \GnVn{fl) - GnVn{f2)\ > 6^ < C- 
n->-oo Vp(/i,/2)<5 / 

An important consequence of stochastic equicontinuity for our purposes is that if {G,il4(- 
stochastically equicontinuous then p(/, /) = Op(l) implies that Gn{Vn{f) — Vn{f)} = Op(l) (Pollard 
|Andrewsl|1994| |. 

Before presenting relevant lemmas, we first need to introduce some notation. Let F denote an envelope 
function for the space F, i.e., a function with P(z) > |/(z)| for every f £ F and z £ Z. Also let A^(e, P, 11 • ||) 
denote the covering number, i.e., the minimal number of e-balls (using distance || • ||) needed to cover F, and 
let ^ 

J{ 5 ,F,L 2 ) = sup ^logN{e\\F\\Q,2,F,L2m de, 

where L 2 (( 5 ) denotes the usual L 2 semimetric under distribution Q, which for any / is ||/| |( 3_2 = (/ f'^dQY^'^. 
We call J(cx), F, L2) the uniform entropy integral. 

To show that a sequence of processes {G„14(-) : n > 1} as defined above is stochastically equicontinuous. 


one can use Theorem 2.11.1 from van der Vaart and Wellner (1996l. (Note that in their notation Z^if) = 


{\/y/n)Vn{f)■) Specifically, Theorem 2.11.1 states that stochastic equicontinuity follows from the following 
two Lindeberg conditions (conditions 1 and 2), with an additional restriction on the complexity of the space F 
(condition 3): 


(1) E{||14||3r /(IlKlb > ^V^)} Ofor every e > 0. 

(2) supp(j^ E[{14(/i) - 14 (/ 2 )}^] 0 for every sequence 5n 0. 

(3) fg" ■\/logN(e, F, L 2 (IPn)) de A 0 for every sequence 5n —^ 0. 

We will give conditions under which two particular kinds of sequences of empirical processes are stochas¬ 
tically equicontinuous. Specifically we consider processes {Gnfn(-) : n > 1} where 

Vnif) = Vh gha{A)Kha{A)f{Z), 

Vnif) = j f{'L,t)gha{t)Kha{t) dt, 


wifh gha{t) and Kfia{t) defined earlier (nofe I4 depends on n since h = hn does). 
Lemma 1. Consider the sequence of processes {Gnfn,j(') : n > 1} with 

uAf)=^ 

where f £ F with envelope F{z) = sup |/(z)|. Assume the following.■ 

(a) The bandwidth h = hn satisfies /i —?■ 0 and nhf —)• 00 n —>• 00. 

(b) The kernel K is a bounded symmetric probability density with support [—1,1]. 

(c) A has compact support A and continuous density w. 

(d) The envelope F is uniformly bounded, i.e., ||P ||2 < fmax < 00. 
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(e) T has a finite uniform entropy integral, i.e., L 2 ) < 00 . 


Then {GnKij(-) : n > 1} is stochastically equicontinuous. 


Proof. Recall that to show stochastic equicontinuity we can check conditions (l)-(3) of Theorem 2.11.1 from 


van der Vaart and Wellner (19961, as given earlier. 


P 

We will show Lindeherg condition (1) using the dominated convergence theorem, which says if Xn — X 

and \Xn\ < Y with E(y) < 00 then E(X„,) —)■ E(X). First note that ||yn,j||jr .f(||14,,j||> s^/n) = Op(l) 
since for any <5 > 0 


Yim^p[\\Vnfi\^r PWynaWT > eM > < 5 } 

< lim p(||14j||.f > 

n^oo \ / 

= Im P|(2l - afi-^K > eVn/i2i-i| 

< J[i^p|(A - ay~^\\K\\\^_i^^fmax > 


The last line above used the kernel and envelope conditions (h) and (c). The expression in the last line tends 
to zero as n 00 , since re/i —)> 00 and nhfi —)• 00 hy the bandwidth condition (a) (note that n/i —00 is 
implied by the fact that /i —)• 0 and nh^ 00 ), and since A has compact support by condition (c). We also 
have ||yn,j|l3^-^(||k"n,j||j' > £:^/n) < since /(•) is the indicator function, and E{|| 14 j||^} < cx) since 



The second line above follows by the distribution condition (c) and the envelope condition (d), and the last line 
is finite by the kernel properties assumed in condition (b). Therefore since ||14,j|l3^ = 

Op(l) and || 14 ,j|l 3 ^ > e^/n) < 1114 ,j||j- with E{|| 14 j|| 3 r} < 00 , the dominated convergence 

theorem implies that E{||14j| |jr F(| |fn,j| |.f > —)• 0 as n —)■ 00 and thus Lindeherg condition (1) holds. 

Lindeherg condition (2) holds when p{-) is the uniform norm since 


sup E[{Kj(/i)-14 ,,(/2)}2] 
p(/l>/2)<y 


= sup E 

\\fl—f2\\z<S„ 


<K 


t — a 


A — a 
h 

2 (i-i) 1 


2(i-i) 




h 


t — a 


K 


dt —> 0 


w{f) dt 

, for any 6n —> 0. 


2 


The first equality above follows by definition, the second inequality by the fact that ||/i — / 2 II .2 < 5^, and the 
third by condition (c) and a change of variables. The last line tends to zero as <5^ —)■ 0 by the kernel properties 
in condition (b). 


Now we consider the complexity condition (3). As described in Section 2.11.1.1 (page 209) of van der 
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Vaart and Wellner (19961, a process (1/ \/n)Vn{f) is measure-like if for some (random) measure Uni we have 
^|k(/i) - K(/ 2 )} < Jifi- hf duni , for every fi, f 2 G J'. 


van der Vaart and Wellner (19961 show in their Lemma 2.11.6 that if fF has a finite uniform entropy integral, 


then measure-like processes indexed hy F satisfy the complexity condition (3) of Theorem 2.11.1. 
Note that for our process Vnj{f) of interest, we have 


l{VnAh) - K,i(/2)}' = {/l(Z) - /2 (z)}'v4 Ik 


A-^ 

~h~ 


Therefore the processes Vn,j{f) are measure-like for the random measure Uni = VhghaKha^Zi, where 5zi 
denotes the Dirac measure. Hence, hy Lemma 2.11.6 of van der Vaart and Wellner] ( |1996| l, the fact that F has 
a finite uniform entropy integral (assumed in condition (e)) implies that complexity condition (3) is satisfied. 
Therefore fhe sequence {Gnl4,j (-) : n > 1} is stochastically equiconfinuous. 

□ 


As mentioned earlier. Lemma 1 implies fhaf if \ \f — f\\z = Op{l) fhen 


Anh{¥n — P) 


A — a 




K 


A — a 
h 


{/(Z)-/(Z)} 


= Op(l)- 


Lemma 2. Consider the sequence of processes {GnKi,j(') : n. > 1} with 

where f € F with envelope F as in Lemma 1. Assume conditions (b), (d), and (e) of Lemma 1 hold. Then 
{GnVnji') : n > 1} is stochastically equicontinuous. 

Proof The proof of Lemma 2 is very similar to fhaf of Lemma 1. We again show Lindeherg condition (1) using 
fhe dominated convergence fheorem. Firsf note ||ln,j||jr F(||V'„j||jr > e^/n) = Op(l) since for any <5 > 0 

lim p{\\VnA^jrI{\\Vnq\\T> e^/n)>^'\ < lim > e\/n) 

n^oo L J n^oo \ / 

= lim p\ f F(L,t){{t — a)/hy~^K{{t — a)/h}/h dt > e\/n\ 
n^oo [. J J 

< lim lifmax [ \u\K^K{u) dt > ey/n] = 0. 
n—^oo L / J 


The lasf line above used fhe kernel and envelope conditions (b) and (d). We also have ||14,j||jr/(||lnj||j- > 

e^/n) < ||14,j||jr and E{||14,j|l3^} equals 


F{L,t) 


t — a 




-K 


t — a 


dt} </, 


2 

max 


uY ^K{u) du 


which is finite again using conditions (b) and (d). Therefore Lindeherg condition (1) holds sinceE{| |14j ||^/(||14j| |_ 7 r > 
Sy/n)} —)• 0 by dominated convergence. 

Lindeherg condifion (2) holds wifh fhe uniform norm since, by definition and using fhe kernel condifion 
(b), supp(j^ E[{K(/i) - K(/ 2 )}^] equals 


sup E 

ll/l —/2|L<5„ 


J {/i(L,f)-/2(L,t)} 


t — a\^ ^1 




t — a 
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< 


Sl[J\ur^K(u)du} ^ 0 , 


for any d„ 


As in Lemma 1, we use that Vnj is measure-like to check condition (3). Here 


-{VnAfl)-ynAh)}"=^ 

n n 


{/i(L,f)-/ 2 (L,f)} 


t — axf-i 1 — a 


."'Lit'''* 


1 2 


< 


1 


n 


/i(L,f)-/ 2 (L,f)y 


t — a 
h 


2(j-l) 


1 — a 




hy Jensen’s inequality. Therefore the processes Vn,j{f) are measure-like, and the fact that F has a finite 
uniform entropy integral (assumed in condition (e)) implies that complexity condition (3) is satisfied. This 
concludes fhe proof. □ 


6. Proof of Theorem 2 


Here we let 9h{a) = gha(“)^DftaIPn{gha(^)-fi"ha(^)?(Z; vf,/l)} denote the infeasihle estimator one would 
use if the nuisance functions were known, with Tiha = '^n{^haA)Kha{-^)^ha{^Y} as in the main text. Our 
proposed estimator is dh{a) = g/ia(«)^DftaIPn{g/ia(^)-K^ha(^)^(Z; vr, /i)}. We use the decomposition 

9h{a) - 9{a) = ^9h{a) - 0 ( 0)1 -f ^Oh{a) - 0/j(o)| = |0h(a) - 0(a)| -f {Rn,i + Rn, 2 ) 


where 


ttS-I/ 


Rn,l = g/xa(a)^D 


ha « 


g/i, 


,{A)KhaiA){i{Z-,n,fi)-A^-F,R)} 


Rn,2 = g/ia(a)'^D JlP gha{A)Kha{A)\A'^-A,fl) “ 


Our proof is divided into three parts, one for the analysis of each of the terms above. 


6.1. Convergence rate of6h{a) - 9{a) 

Since the infeasihle estimator 0/i(a) is a standard local linear kernel estimator with outcome ^(Z; JZ) and 
regressor A, it can he analyzed with results from the local polynomial kernel regression literature. In particular, 
since our Assumption 2 (Positivity) a long with c onditions (h), (c), (d) of our Theorem 2 imply the bandwidth 
condition and conditions l(i)-l(iv) in Fan (19931, by their Theorem 1 we have E[0/j(o) — E{^(Z; vf, JZ) | A = 
a}]^ = 0{l/nh + hY- Further, condition (a) of our Theorem 1 implies E{^(Z;7f, /Z) | A = a} = 9{a) by the 
results in Section 3 of this Appendix. Therefore E{0/i(a) — 9{a)Y = 0{l/nh + /i^). 

Now let Xn = 9h{a) — 9{a). The above implies that, for some M* > 0, limsup^_^oo E{A'^/(l/n/i -|- 
A)} < AI*. Therefore for any e > 0, if M > M*/e. 


(. 


Xi 


lim P . 

n^oo Y 1 / nh + h* 


> M \ < lim sup -—E 


(. 




n^oo M yi/nh + R^ 


< M*/M < e 


where the first equality follows by Markov’s inequality, the second by the fact that E(X^) = 0{l/nh + A) 
and the third by definition of M. Since e > 0 was arbitrary this implies {0/i(a) — 9{a)Y = Op{l/nh + /i^). 
Now let = 1 /^/nh + h? and Cn = 1/nh + and note that 


P 


X„ 


>Vm]= p 


Xi 


Cn + 2h^AhJn 


> M \ < P 




> M 
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Taking limits as n —)■ oo implies that 


9h{a) - 0{a) 



6.2. Asymptotic negligibility of Rn,i 

Now we will show that 

ghaiA)Kha{A)^i{Z;Tr,fi) 

is asymptotically negligihle up to order \Anh, i.e., |-Rn,i| = Op{l/\AnJi). 

First we will show that = Op(l). Consider the elements of the matrix D/ia- Using the 

continuity of w from condition (d) of Theorem 2 in the main text, along with properties of the kernel function 
from condition (c), it is straightforward to show that 

E([Fn{Kha{A)} - w{a)f) = 0{h) + 0{l/nh). 

Hence E([P„{iF/ja(yl)} — ro(a)]^) = o(l), since /r —)■ 0 and n/i —)■ oo hy condition (h), and therefore 
Fn{Kha{A)} A w{a) hy Markov’s inequality. This is a standard result in classical kernel estimation problems. 
By the same logic we similarly have 


Rn,l=^ha{aYtill{Fn-F) 


Fn{KUA){A-a)/h} 4 0, 
'^n[Kha{A){{A- a)/hY] 4 w{a) J v?K{u)du. 


Therefore g/ia(a)'^D^2 ^ O) diag{ro(a), t:i7(a)j^2} ^ = {r^{a) ^ O) , where diag(ci,C 2 ) is a (2 x 2) 
diagonal matrix with elements ci and C 2 on the diagonal, V 2 = f u^K{u) du, and w{a) / 0 hecause of 
Assumption 2 (Positivity). Thus we have shown that g/ia(a)^D ha = (^(4 ^ 0) + Op(l) = Op{l). 

Now we will analyze the term 


ShaiA)KiiQ 


Z;7f,/i) -4Z;7r, 


which we will show is Op{l/^/nh). This is equivalent to showing 


Yhgha{A)Kha{A)i{Z)] =Gn [Yh ^ha{A)Kha{AY{Z)]] +Op{l), 


where we define 4Z) = fr, fl) and 4^) = 4^; ff, Jt)- Note that, as discussed in the previous section on 
stochastic equicontinuity, if 1 — ,^| |^ = Op(l) then the above result follows if the sequence of empirical pro¬ 
cesses {GnVn{-) : n > 1} is stochastically equicontinuous, where we define Un(C) = '/hSha{A)Kha{A)^{Z) 
with G H for some metric space H. Thus first we will show that ||.^—^| 1^ = sup^g,^ A) “4^5 ^4)1 = 

Op(l). Then we will check the conditions given in Lemma 1 of the previous section, which ensure that 
{GnUn(-) : n > 1} defined above is stochastically equicontinuous. 

First note that after some rearranging we can write 


= .. I ro(a) + m{a) - , ,, wia, 


Tt{a I 1) 
y-Jl{l,a) w{a) 
7f(a I 1) 7r(a | 1) 


7 r(a I 1) 


|7r(a I 1) - 7f(a | 1)| + 


w{a) 
Tt{a I 1) 


I — m(a) 

{ 71 ( 1 , 0 ) - 
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+ 


y - 

TT{a I 1) 


^w{a) — -07(0)1 + |?7i(o) — m(a)|. 


Therefore, letting ^(z) = ^(z; tt, (1) and similarly ^(z) = ^(z; vr, /x), by the uniform boundedness assumed in 
condition (e) and the triangle inequality we have 


11^ - ^lU = Op(^11^ -A\z + \\y-y\\z + \\'^-M\A + \\m- m\\A 

Therefore since Hvr — ^||^ = Op(l) and \ \jl — JL\\z = Op(l) by definition, and since Op{op{l)) = Op(l), the 
above implies 


III - = Op[\\uj -w\U + \\m- m\Uj +Op(l). 

Now, since by definition w{a) = Pn{7r(a | L)} and w{a) = E{7f(a | L)}, we have that 


1^ ~ '^\\a = sup |o7(a) — - 07 ( 0)1 = sup ¥n’^{a \ L) — P7r(a 
d^A d^A 


= sup 
d^A 

< sup 
d£A 


P„{7r(a I L) - 7r(o | L)} + 
P„{-7r(a I L) — Tf{a \ L)} 


- P)^(o I L) 


+ sup 

(En 

- P)w(a 1 L) 

deA 




< ||vr — -ttII^ + sup 
aeA 


— r vr o 


where the last two lines used the triangle inequality. By definition the first term on the right hand side of the 
last line is Op(l), and by the uniform entropy assumption in condition (e) the second term is also Op(l) since it 
implies that if is Glivenko-Cantelli ( van der Vaart[ 2000 van der Vaart and Wellnerj 19961. Therefore we have 
11'^ ~ ^1U = Op(l). By exactly the same logic, using definitions and condition (e) we similarly have 


|m — -mll^ < sup 

( 2 Gi/ 4 . 


Pn{/l(L,o) - /u(L,o)} 


+ sup 
d£A 


- P)/x(L,o) 


< HA - y\\z + sup 

d£A 


-P)/i(L,o) =Op(l). 


Therefore |||-||U = sup^g^ ||(z; tt, A) - |(z; tt, /x)| = Op(l). 

Now we will show that the conditions given in Lemma 1 hold, indicating that the sequence {G,il4(-) : 
n > 1} defined above is sfochastically equicontinuous. Conditions (a)-(c) of Lemma 1 are given exactly in 
the statement of Theorem 2 and so hold immediately. For conditions (d) and (e) of Lemma 1 we need to 
consider the space E containing elements |(z). The space E can be constructed as a transformation of the 
spaces (J>,-Pp,-PA j;-P m) containing the functions {tt, n,w,m), along with the single identity function that 
takes Z as input and outputs Y. Specifically, we have 

H = (y © © Tm 

where Y is shorthand for the single function that outputs Y from Z, and we define = {/i + /2 : fj £ 

J^j}, = {!//:/£ and similarly = {/ 1/2 : fj £ J^j}, for arbifrary funcfion classes and Tj 

confaining funcfions / and fj respectively. For more discussion of such consfrucfions of higher-level funcfion 
spaces based on lower-level building blocks, we refer fhe reader fo Pollard (19901 (Section 5), Andrews (19941 
(Section 4.1), van der Vaarf and Wellner (19961 (Section 2.10), and |van der Vaart (20001 (Examples 19.18- 
19.20); for use in a related example and more discussion see |van der Vaart and van der Laan| (|2006| (Section 
5). 

By condition (e) of Theorem 2, the classes (JA-,-TAj,-^ m) are uniformly bounded (i.e., their minimal 
envelopes are bounded above by some constant). Similarly the class is also uniformly bounded by the 
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second part of condition (e). Therefore the constructed class H is hounded as well, so that condition (d) of 
Lemma 1 holds. 


Condition (e) of Lemma 1 can he verified hy using permanence or stability properties of the uniform entropy 
integral ( Andrews'} 1994 van der Vaart and Wellner[ 1996(1 van der Vaart and van der Laan] 20061. Specifically, 
by condifion (e) of Theorem 2, fhe classes all have a finite un iform entropy in tegral (as does 

the single function Y, or any finite set of functions). Therefore by Theorem 3 of Andrews (19941, since is 
appropriately bounded with finite envelope, it follows that the class E also has a finite uniform entropy integral. 
Thus condition (e) of Lemma 1 holds. For results similar to Theorem 3 of|Andre^(|1994|), also see Theorem 


2.10.20 of van der Vaart and Wellner (19961, and Lemma 5.1 and subsequent examples of van der Vaart and 


van der Laan (20061. 


Thus since the conditions of Lemma 1 hold, the sequence {Gnl4(-) : n > 1} with Vn{^) = VhghaiA)KhaiA)^{Z) 
is stochastically equicontinuous, and since ||^ — ^||^ = sup 2 ,g_z |^(z;7r, /I) — = Op(l), it therefore 

follows that 

(P„-P) gha{A)Kha{A)^i{Z]-K, fi) - = Op{l/V^). 


Combined with the fact that = Op{l), this implies that i = Op{ljVnh) and so is asymptoti¬ 

cally negligible. 


6.3. Convergence rate of Rn ,2 

In this section we will derive the convergence rate of 


Rn,2 = gha{A)Kh^ 


Z;vr,/1) - C(Z;7r, 


which will depend on how well the nuisance functions vr and /r are estimated. 

In the previous subsection we showed that = Op{l) using conditions (b), (c), and (d) of Theo¬ 

rem 3, along with Assumption 2 (Positivity). Therefore we will consider the term P[g/ia(A)iT/ja(A){^(Z; vf, fi) — 
^(Z; 7f, p)}], which is a vector with element {j = 1, 2) equal to 

J^9ha,jit)Khait) IP|l(Z;7f,/l) - ^{Z;W,Jl) \ A = dt, 

where gha,j{t) = {(f — a)/hy~^ as before. Note that 


P{^(Z; 7f, fi) - C(Z; vr, /r) I A = f} = 
{^(L,f) -/i(L,f)| 


y-F(L,A) 
7f(A I 'L)/'dj{A) 

TT{t I Ij)/w{t) 


A = f > -(- rh{t) — 0{t) 


w{t) 

w{t) 




7f(t I L)/ro(f) 

7r{t I L) — 7r{t I L 


-|- m{t) — 6{t) 


7r{t I L) 


+ P|7f(f I L) - 7r(f I L)|p|^(L,f) - /1(LA)} 

+ I L)} + (P„ - P){/i(L, t)}. 

TU [tj 


The first equality above follows since E{^(Z; if,]!) \ A = t} = 9{t) because either ^ = vr or /I = ^ (as shown 
in Section 3), the second by iterated expectation and the fact that p{l \ a) = {7r(a | 1) /'w{a)'\p{\), and the third 
by rearranging terms and the definitions = Pn{7r(f | L)} and rh{t) = Pn{/i(L, t)}. 

Therefore using the Cauchy-Schwarz inequality (P(/p) < 11/| | 11^11), the triangle inequality. Assumption 2 
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(Positivity), and the uniform boundedness assumed in condition (e), we have 


IP 9ha,jiA)Kha{A)^i{Z]n,fi) - ^(Z;7r,/r)| 


= Op( 

J 

9ha,j{t)Kha{t) ||vr(f 1 L) - 7r(f | L)|| ||/i(L,f) - //(L,t)|| dt 
A 

+ 

(Pn - P) / 9ha,j{t)Kha{t) 7r(f | L) dt 

Ja 


+ 

(Pn - P) / 9ha,j{t)Kha{t) A(L, f) dt 

Ja 

\ 


The last two terms above can be controlled by Lemma 2 in this Appendix. Specifically, this lemma can be 
applied since its condition (b) corresponds exactly to condition (b) of Theorem 2, and since its conditions (d) 
and (e) are implied by condition (e) of Theorem 2. Therefore since | Ivr — = Op(l) and 11/) — p| = Op(l) 

by definition, the stochastic equicontinuity result of Lemma 2 implies that 

(Pn J 9 ha,jit)Khait)\^TT{t \ L) - Tf{t | L)| dt = Op{l/y/n), 

and similarly replacing vr with //. Therefore by the central limit theorem we have 


(Pn-P) f gha,jit)Khait) IL) dt = Op{l/^/n), 

Ja 

and similarly replacing vr with fx. Thus the last two terms in the inequality on the previous page are asymptoti¬ 
cally negligible up to order VnE since 

Xn = Op{l/^) ^/nXn = Op{l) V^Xn = Op{l)Op{l) = Op{l). 


Therefore since Op{op{l/\Enh)) = Op{\/\pnh'), we have 

gha,jiA)KhaiA)^iiZ;Tr,fi) 


= o. 


lA 


gha,j{t)Kha{t) dt 


+ OpiXj^fnh) 


where (/)^(f) = ||7r(f | L) - 7r(f | L)|| and = \\fi(L,t) - ^(L,f)||. 


Now let ||A:||[_i^i] = Kmax- Since K{u) < Kmaxl{\u\ < 1), we have 


lA 




lA 


t — a\^ ^ 1 f t — a 


<j)n{t)4)pi{t) dt 


< K, 


—2 -^^max 


sup (l>Tr{t) > < sup (l>f_,{t) > / |uP ^ du. 

t:\t—a\<h \ \ t:\t—a\<h \ J— 1 


t:\t—a\<h 


In the main text we define r„(a) and Sn{a) so that supj.|i_Q|<;j </> 7 r(f) = Op{rn{a)) and = 

Op{sn{a)). Therefore 


9ha,j{A)Kha{A)\^{Z;TT,fl) -^(Z;7r,/i) \\ \ = Op[rn{a)sn{a)). 


Combining the above with the results from subsections 6.1 and 6.2 yields the desired rate from the statement 
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of Theorem 2, 


Oh{a) - 9{a) 


= Or 


\\/nh 


+ + rn(a)sn(a' 


7. Proof of Theorem 3 

As in Theorem 2, we again use the decomposition 

9h(a) - 0{a) = - 6'(a)} + |^/i(a) - = \^9h{a) - 6'(a)| + {Rn,i + Rn,2) 


where 0ft(a) = g,ja(a)'^D^^^P„{g,j„(A)A:,ja(A)|(Z; vr,/i)} is ourproposedestimator, 0/,(a) = gha{ay'Dhl^n{gha{A)Khai 
is the infeasible estimator with known nuisance functions, t)ha = ^n{gha{A)KfiaiA)gha{^V}, and 


Rn,l = gha{art)-^{Fn-F) [gUA)Kha{A) [i{Z; TT, fi) - ^{Z-,7f,]l)} 


Rn,2 = g/ia(a)^DftaIP gha{A)Khc 


Z;vr,/i) - C(Z;7r,/r; 


We consider each term separately, as in the proof of Theorem 2. 


Fan et al. 


( |1994| , 


7. 1. Asymptotic normality of Oh [a) - 6{a) 

After scaling, the first term 9h{a) — 9{a) above is asymptotically normal by Theorem 1 from 
since 9h{a) is a standard local linear kernel estimator with outcome ^(Z; vf,p) and regressor A, and since 
E{^(Z;7f, p) I A = a} = 0{a) by condition (a) (i.e., either vf = tt or ft = p) as shown in Section 3 of this 
Appendix. Similar proofs for the asymptotic normality of local linear kernel estimators can be found elsewhere 
as well ( |Fan[ |1992| Fan et al. 1995| Masry and Fan[ 1997] Li and Racine[ 2001). Specifically, under conditions 
(b), (c), and (d) of Theorem 3 stated in the main text, the proof given by Fan et al. (19941 shows that, for 
hh{a) = 9"{a){h^/2) f v?K{u) du, we have 


nh^9h{a) - 9{a) - 6/i(a)| A- N fo, 


a' 


'{a) f K(u)^ du 


m{a) 


where, using the fact that E{|(Z; vr, ft) | A = a} = 9{a) and rearranging, 
(T^(a) = var{^(Z;7", ft) | A = a} 


= E 


= E 


= E 


= E 


?(Z; TT, ft) - E{^(Z; TT, ft) I A = a} 
Y - ^ 


7r(A I L)/tr7(A) 

y-ft(L,A) 

If {A I L)/ro(A) J 


A = 


A = a 


A = a 


— {0(a) — m(o)}^ 


r^(L, a) + {ft(L, a) - ft(L, g)}^ 
{Tf{a I L)/w(a)}2/{7r(a | L)/tz7(a)} 


— |0(a) — m(o)| . 


7. 2. Asymptotic negligibility of Rn,i 

We showed Rn^i = Op{l/Vnh) in the proof of Theorem 2 in Section 6.2. 
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7.3. Asymptotic negligibility of Rn ,2 


In the proof of Theorem 2 in Section 6.3 of this Appendix, we showed that Rn ,2 = Op{rn{a)sn{a)), where 
r„(a) and Sn{a) are the local rates of convergence for the nuisance estimators if and fl, as defined in the main 
text. By condition (f) of Theorem 3, we have rn{a)sn{a) = Op{l/^f^) so that R^fl = Op{op{l/y/r^)) = 
Op{l/^fr^), and thus Rn^ is asymptotically negligible up to order 

Therefore the proposed estimator 9h{a) is asymptotically equivalent to the infeasihle estimator 9h{a). This 
yields the result from Theorem 2 in the main text. 


8. Uniform consistency 


In this section we sketch some conditions under which our estimator is not only consistent pointwise hut also 
uniformly in the sense that supQgy^ \ 9h{o) — 9{a)\ = Op(l), and give a rate of convergence. However we leave 
a full treatment of this result to future work, in which we will also explore weak convergence of 9h{a) to some 
Gaussian process. This will he useful for testing and inference. 


We use the same decomposition as in Sections 6-7 proving Theorems 2-3, 

9h{a) - 9{a) = ^9h{a) - 6'(a)| -f -f Rn, 2 {a) 


with i(a) = Rn,i and Rn, 2 {o) = 72^,2 defined as before. From Masry (1996) and 
others), under standard smoothness/bandwidth conditions we have 


Hansen 


(20081 (among 


sup \9h{a) - 9{a)\ 
a&A 



Further, if the empirical process Vn{a) = y/ntij log nRn,i (a) is stochastically equicontinuous, then since 
y/nh\Rn,i{a)\ = Op(l) for any a G ^ we have 


sup \Rn,i{a)\ 
aeA 


Op (^y/logn/nh^ 


and so is asymptotically negligible. Finally the same logic as in Section 6.3 yields 


sup \Rn, 2 {a)\ = Op ( sup ||^(a | L) - 7r(a | L)|| • ||/t(L,a) - /i(L,a)|| , 

a&A VaSA / 

SO that for sup„g _4 ||7f(a | L) — 7r(a | L)|| = Op{r^) and similarly for jl and s* we have 


sup \9h{a) - 9{a)\ 

a&A 


Or. 
















9. Sample R code 


Continuous treatment effects 


### INPUT: 1 is an n*p matrix, a and y are vectors of length n 

### 1 = matrix of covariates 

### a = vector of treatment values 

### y = vector of observed outcomes 

# set up evaluation points & matrices for predictions 
a.min <- min(a); a.max <- max(a) 

a.vals <- seq(a.min,a.max,length.out=100) 

la.new <- rbind(cbind(1,a), cbind( 1[rep(1:n,length(a.vals)),], 
a=rep(a.vals,rep(n,length(a.vals))) )) 

1.new <- la.new[,“dim(la.new)[2]] 

# fit super learner (other methods could be used here instead) 
si.lib <- c("SL.earth","SL.gam","SL.gbm","SL.glm","SL.glmnet") 
pimod <- SuperLearner(Y=a, X=l, SL.library=sl.lib, newX=l.new) 
pimod.vals <- pimod$SL.predict; sq.res <- (a-pimod.vals)"2 
pi2mod <- SuperLearner(Y=sq.res,X=l, SL.library=sl.lib, newX=l.new) 
pi2mod.vals <- pi2mod$SL.predict 

mumod <- SuperLearner(Y=y, X=cbind(1,a), SL.library=sl.lib, 
newX=la.new,family=binomial); muhat.vals <- mumod$SL.predict 

# construct estimated pi/varpi and mu/m values 

approx.fn <- function(x,y,z){ predict(smooth.spline(x,y),x=x2)$y } 
a.std <- (la.new$a-pimod.vals)/sqrt(pi2mod.vals) 

pihat.vals <- approx.fn(density(a.std[1:n])$x, density(a.std[1:n])$y, 
a.std); pihat <- pihat.vals[1:n] 

pihat.mat <- matrix(pihat.vals[-(1:n)], nrow=n,ncol=length(a.vals)) 
varpihat <- approx.fn(a.vals, apply(pihat.mat,2,mean), a) 
varpihat.mat <- matrix(rep(apply(pihat.mat,2,mean),n), byrow=T,nrow=n) 
muhat <- muhat.vals[1:n] 

muhat.mat <- matrix(muhat.vals[-(1:n)], nrow=n,ncol=length(a.vals)) 

mhat <- approx.fn(a.vals, apply(muhat.mat,2,mean), a) 

mhat.mat <- matrix( rep(apply(muhat.mat,2,mean),n), byrow=T,nrow=n) 

# form adjusted/pseudo outcome xi 

pseudo.out <- (y-muhat)/(pihat/varpihat) + mhat 

# leave-one-out cross-validation to select bandwidth 
library(KernSmooth); kern <- function (x) { dnorm(x) } 

w.fn <- function(bw){ w.avals <- NULL; for (a.val in a.vals){ 
a.std <- (a-a.val)/bw; kern.std <- kern(a.std)/bw 
w.avals <- c(w.avals, mean (a. std''2*kern. std) * (kern (0)/bw) / 

(mean(kern.std)*mean(a.std"2*kern.std)-mean(a.std*kern.std )~ 2 )) 

}; return(w.avals/n) } 

hatvals <- function(bw){ approx(a.vals,w.fn(bw),xout=a)$y } 
cts.eff <- function(out,bw){ approx(locpoly(a,out,bw),xout=a)$y } 

# note: choice of bandwidth range depends on specific problem 
h.opt <- optimize( function(h){ hats <- hatvals(h); 

mean( ((pseudo.out - cts.eff(pseudo.out,bw=h))/(1-hats))^2) }, 

c(0.01,50), tol=0.01)$minimum 

# estimate effect curve with optimal bandwidth 

est <- approx(locpoly(a,pseudo.out,bandwidth=h.opt),xout=a.vals)$y 
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# estimate sandwich-style pointwise confidence band 
se <- NULL; for (a.val in a.vals){ 

a.std <- (a-a.val)/h.opt; kern.std <- (kern(a.std)/h.opt)/h.opt 
beta <- coef(Im(pseudo.out ~ a.std, weights=kern.std)) 

Dh <- matrix( c(mean(kern.std), mean(kern.std*a.std), 
mean(kern.std*a.std), mean(kern.std*a.std"2)), nrow=2) 
kern.mat <- matrix(rep(kern((a.vals-a.val)/h)/h,n), byrow=T,nrow=n) 
g2 <- matrix( rep((a.vals-a.val)/h, n), byrow=T, nrow=n) 
intfnl.mat <- kern.mat*(muhat.mat - mhat.mat)*varpihat.mat 
intfn2.mat <- g2*kern.mat*(muhat.mat - mhat.mat)*varpihat.mat 
inti <- apply(matrix(rep((a.vals[-l]-a.vals[-length(a.vals)])/2,n), 
byrow=T,nrow=n)*intfnl.mat[,-1]+intfnl.mat[,-length(a.vals)],1,sum) 
int2 <- apply(matrix(rep((a.vals[-l]-a.vals[-length(a.vals) ])/2,n) , 
byrow=T,nrow=n)*intfn2.mat[,-1]+intfn2.mat[,-length(a.vals)],1,sum) 
sigma <- cov(t(solve(Dh) %*% 

rbind( wt*(out-beta[1]-beta[2]*a.std) + inti, 
a.std*wt*(out-beta[1]-beta[2]*a.std) + int2 ))) 
se <- c(se, sqrt(sigma[1,1]) ) } 

ci.ll <- est-1.96*se/sqrt(n); ci.ul <- est+1.96*se/sqrt(n) 



